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With today’s digital revolution, many people communicate and collaborate 
in cyberspace. Users rely on social media platforms, such as Facebook, 
YouTube and Twitter, all of which exert a considerable impact on human 
lives. In particular, watching videos has become more preferable than simply 
browsing the internet because of many reasons. However, difficulties arise 
when searching for specific videos accurately in the same domains, such as 
entertainment, politics, education, video and TV shows. This problem can be 
solved through web video categorization (WVC) approaches that utilize 
video textual information, visual features, or audio approaches. However, 
retrieving or obtaining videos with similar content with high accuracy is 
challenging. Therefore, this paper proposes a novel mode for enhancing 
WVC that is based on user comments and weighted features from video 
descriptions. Specifically, this model uses supervised learning, along with 
machine learning classifiers (MLCs) and deep learning (DL) models. Two 
experiments are conducted on the proposed balanced dataset on the basis of 
the two proposed algorithms based on multi-classes, namely, education, 
politics, health and sports. The model achieves high accuracy rates of 97% 
and 99% by using MLCs and DL models that are based on artificial neural 
network (ANN) and long short-term memory (LSTM), respectively. 
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1. INTRODUCTION 


The convenient accessibility and speed of the internet has made it a staple tool for many people. The 
most noticeable and rapidly growing spheres in the context of videos are Daily motion and YouTube. 
YouTube is known as the largest repository of videos and is widely used for video sharing by billions of 
users [1]—[4]. However, given the massive number of videos on the web, users face difficulties in accurately 
retrieving and obtaining the videos they need [5], [6]. The best method to examine, extract and classify web 
videos on the basis of content similarity is web video categorization (WVC) [7]-[10]. As the number of 
videos on the web has increased exponentially, the traditional way of manually processing video 
categorization has become time consuming and thus requires much effort [11]. Along with software 
applications for categorization purposes, human intervention is sometimes necessary for refining 
categorization. Therefore, extensive effort is spent on areas of WVC with automatic concepts that can help 
improve video retrieval accuracy retrieve videos with high content similarity. The similarity of user queries is 
also used to increase user satisfaction with viewing relevant and required videos. 
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Existing research has focused on WVC that uses classification [7], [12]-[17] or clustering 
techniques [18]—[21] and surveys [9], [10], [14], [22], [23]. Categorizing web videos is generally based on 
visual, audio or textual information. In visual categorization, the main focus is to extract video frames whilst 
dealing with them as images. The features extracted, such as faces, objects, colors and shapes, are used to 
compare and classify processes. Audio-based features are extracted from videos, those features are the 
signals from sounds such as music, loudness and pitch that represent the values used in the classification 
process. For example, the sound of music is different from the sound of speech. Moreover, a male voice is 
different from that of a female. Perceptual features, such as music, and violent words differ from each other. 
Finally, in textual information, authors use the textual information of video titles or video descriptions or 
their metadata. The combination of visual-based and audio-based categorization can result in satisfactory 
improvement. However, WVC is a massive challenge in computer vision and machine learning [20], [21]. 

People share their thoughts, ideas, beliefs, daily activities, experiences, entertainment, feelings and 
academic knowledge in the form of comments [24], [25]. Users comment on and like and dislike videos to 
express their ideologies. These comments are considered unstructured data that can be relevant or irrelevant 
for video content [26]. Such relevant data can be useful for further processing in WVC, particularly in 
platforms such as YouTube. Therefore, the current work explores the existing methods and techniques for 
WVC. In addition, this study proposes a novel model called the enhanced multiclass web video 
categorization model (EMVC). The proposed EMVC enhances the way in which WVC is conducted by 
utilizing and extracting user comments and weighted features from video descriptions using machine and 
deep learning (DL) approaches as a form of supervised learning. In addition, this work examines the 
ma-chine learning classifiers (MLCs) and DL models for the proposed algorithms by using the proposed 
dataset. The dataset was collected from four types of YouTube videos, namely, sports, health, education and 
politics, as predefined classes. A total of 86 videos with 42,668 user comments and video descriptions were 
used. The dataset called Arabic multi-classification dataset (AMCD), publicly available in [27]. AMCD was 
subjected to several steps, including annotation, noise removal, data cleaning and data pre-processing, model 
building and model evaluation. After the completion of the pre-processing steps, the dataset was reduced to 
8,046 user comments and was thus considered balanced. The two distinct experiments were conducted using 
MLCs and DL models on the basis of two proposed algorithms. These algorithms utilized the textual 
information extracted from user comments and video descriptions to extract informative features. These are 
given weights based on term frequency-inverse document frequency (TD-IFD) and the average and 
maximum weights of term frequency-inverse document frequency (TF-IDF) of user comments to the video 
description. The model showed good accuracies of 97% and 99% using MLCs and DL models that were 
based on artificial neural network (ANN) and long short-term memory (LSTM), respectively. 

The main contributions of this work: i) it explores the existing techniques for WVC and highlights 
the importance of using user comments and video metadata to enhance WVC; ii) it proposes the EMVC that 
is based on video descriptions and user comments to enhance WVC through MLCs and DL models; iii) it 
introduces a new dataset (AMCD) that is based on the Arabic dialect collected from 86 YouTube videos with 
8,046 user comments; iv) it proposes a novel mathematical equation for improving WVC through two 
scenarios and by using two proposed algorithms that utilizes the average and maximum TF-IDF weights of 
user comments to the video descriptions; and v) it examines the importance of using user comments and 
video descriptions in video classification. 

The rest of the paper is organized: in section 3 explains proposed methods, model architecture and 
design. Section 3 presents the proposed mathematical equation while the experiments, results and discussion 
are presented in section 4. Finally, the conclusion of this paper is described in section 5. 


2. PROPOSED METHOD 

This section demonstrates the methods and system architecture of the propose model for WVC as 
shown in Figure 1. The system architecture consists of six main interrelated phases namely; data acquisition, 
pre-processing, term extraction and word representation, term weighting, classification methods, and model 
evaluation. 


2.1. Data acquisition phase 

The first level in the model is known as the input phase. In this phase, data is collected from 
YouTube videos to use in the video categorization process. According to the core objective of this research, 
the enhanced video categorization is based upon four predefined classes; including health, sport, politics, and 
education. The determining criteria are the video description and video comments required to extract. The 
Arabic videos and their associated Arabic comments will be used in the experiments on the condition that the 
video publication date ranged from 2015 till 2020 and obtained more than 2000 comments. Python 3.6 and 
YouTube are used for data collection. During the extraction process of the video description and user 
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comments, the required information is extracted into a single file for each video. In addition, some of the 
attributes in the file are removed. The output of this phase is used as the input in the pre-processing phase. 


Dataset Acquisition Phase Pre-processing Phase Term Extraction Phase Word Representation Phase 


uc TF (UC and VD) 
NLP processing vD TF-IDF (UC and VD) 


Term Weighting Phase 


User comments (UC) 


Video Description (VD) Matric with maximum TFIDF 


Or Average TFIDF 
(Proposed Algorithms) 


C >D Classification Methods Phase Model Evaluation Phase 


Dataset 


Machine learning 


: Model evaluation 
Deep Learning Test 


Figure 1. EMVC architecture 


2.2. Pre-processing phase 

There are three steps in the pre-processing phase. They are data cleaning, annotation process, and 
data pre-processing. In data cleaning, the initial process is to remove the duplicate records and remove any 
English comments, numbers, or tags. The data in each file that belongs to one video is cleaned. The 
annotation process is started in the second phase with help of three Native Arabic annotators in computer 
science. All three annotators are scholars with PhDs. In this process, if two annotators agreed on one 
classified video that belongs to one of the four classes, the decision is taken that the video belongs to a 
specific predefined class. Otherwise, the comments are removed if they are not clear or ambiguous. During 
the annotation process, the class labelling is given for health is “1”, for education “2”, for politics “3” and 
sport is “4”. After the annotation process, the files are collected in one single file. In the data pre-processing 
step, Python 3.6 is utilized to perform automatic pre-processing for the dataset. Several steps such as the 
removal of any HTML tags, numbers, English characters, character extensions, and repeated characters using 
regular expressions are performed. The porter stemming was used to obtain the root of the words. 


2.3. Term extraction and word representation phase 

In this phase, the terms are extracted from both the video description (VD) and user comments (UC) 
as described in definition 1 and definition 2. For each comment, a set of extracted words with videos 
description is called word representation. These comments are transformed into vector representation using 
“TfidfTransformer” in the “sklearn”. For each video, the set of comments is called vector representation and 
denoted by VR. The set of vector representations is called a data collection, denoted by DC, containing sets 
of comments for all videos. For each video, the combination of terms presented in users comments and video 
description is called word representation (WR). 
- Definition 1. Given a video vEV, the set of user comment UC’and video description VD” for v is 

defined: 


{UC" + VD"} 
- Definition 2. Given n videos, the set of word representation (WR) is defined: 


WR = {{ UC! + VD+}, { UC? + VD?}, { UC? + VD?} -+ {UC™ + VD"}} 
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- Definition 3. Given n videos, the set of comments is called vector representation (VR) is defined: 
VR = { WRt, WR2,WR?, = WR} 
- Definition 4. Given n videos, the set of vector representation is called data collection (DC) is defined: 
DC = {VR1,VR2,VR3, ee VR"} 


2.4. Term weighting phase 

In the term weighting phase, the proposed mathematical formula has been employed based on the 
TF-IDF. The TF-IDF has been extracted from user comments and YouTube metadata, particularly on the 
video description only, the mathematical formula as shown in (1): 


W(w,C) =TF (w) C Log — (1) 


CF(T) 


where, TF(w) C is denotes number of word (w) in comment (C). CF(T) is denotes number of comments 
containing word (w). N is denotes is the total number of comments in dataset. 


2.5. Classification methods phase 

In this phase, two types of classification methods were used are; classical machine learning 
classifiers (MLC) and deep learning. In classical machine learning classifiers, k-nearest neighbours (KNN), 
naive Bayes (NB), decision trees (DT), random forest (RF), support vector machine (SVM) and regular 
regression. 


2.5.1. Naive Bayes (NB) 

Naive Bayes (NB) classifiers, known as a parametric classifier which is based on some parameters. 
It is a simple probabilistic classifier based on concepts of the Bayes theorem in statistics. It is used to solve 
the classification problem with assigned data points to class label with an independence concept. The next 
mathematical has been used in the proposed model: 


5 (2) 2 P(5)*P(B) x 


P(A) 


where, B is the collection of text in specific class/classes, Let B={Education, Health, Sport and Politics}. A is 
the word or comments, Let A={ User comments and Video Descriptions}. P(A/B) is probability of that word 
or comment B is belong to class A. P(B/A) is Probability of that the word or comment (A) in the specific 
class (B). 


2.5.2. K-nearest neighbours (KNN) 

KNN is non-parametric classification algorithm. It is classifying dataset based on the distance 
between data points using similarity measure using distance function such as Euclidean distance, Manhattan 
distance, cosine similarity, chi-square and correlation. KNN classify data points to its close neighbours so the 
more close distance is assigned to the same category. The k represents to which group the data point is 
assigned known as nearest neighbours, if the K is odd the voting will be considered and the majority will be 
considered. 


2.5.3. Decision tree (DT) 

Decision tree is non-parameter machine learning classifier. It uses a concept of tree structure which 
consist of root, children/internal nodes and tree leaf nodes. In this way, the dataset into split based on 
threshold and some conditions from the tree root until reach tree leaves. The tree internal node represents the 
testing process on the features while tree leaf represent the decisions and class labels as shown in Figure 2. In 
the decision tree classifier, we need to start with one root of the extracted features in DC in order to do this, 
we are required to know the highest information gain of the extracted features. This can be calculated using 
the (3) which is based on calculating the entropy using the (4). Where, Pi is the probability of that features in 
the data collection (Call) belong to class i. 


Info(Cau) = — Xiz1 P; log2(P;) (3) 


Information Gain (feature) = (Info(Cau) — Inf Ofeature (Can) (4) 
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Figure 2. Decision tree (DT) 


2.5.4. Random forest (RF) 

Random forest (RF) is conation several DTs as shown in Figure 3. The dataset is divided randomly 
into all the DT and also can be duplicate to DT. The final results of the model are based on the majority vote 
of outcomes of DTs model. In addition, a large number for DT increase the model performance accuracy. 
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Figure 3. Random forest (RF) 


2.5.5. Support vector machine (SVM) 

SVM is mainly used for binary classification problem. It is divided the data points in the 
multidimensional space into two classes based on the supports vectors which are closest to the hyperplane. In 
this case which is multi-classification, SVM breaks down the problem into binary classification problem 
based on two main approaches are one-to-one or one-to-rest. 


2.5.6. Deep learning model 

In deep learning, this study used an ANN model which known as multilayer perceptron (MLP) [28] 
in order to classify the proposed data and to examine the model performance. Generally, the deep learning 
model consists of three layers namely; hidden and output players. In this study, the input layer received it 
from the maximum or average of the TF-IDF. The output layers consist of four neurons for the four classes 
(education, health, politics and sport). The ANN as shown in Figure 4. 
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Figure 4. Artificial neural network architecture 


2.6. Model evaluation 

In order to evaluate the model performance, the most popular methods have been used the confusion 
matrix as shown in Figure 5 and the cross-validation process. The confusion matrix has been used to evaluate 
the model performance in the accuracy. In the confusion matrix, the recall, precision, F-score and accuracy 
have been utilized based on next the mathematical formulas (5)-(8): 


ma TP 
Precision = —— (5) 
TP+FP 
TP 
Recall = (6) 
TP+FN 
2*(Precision*Recall 
F1 — score = A ec ) 
Precision+Recall 
TP +TN 
Accuracy = —————_ (8) 
TP+TN+ FP +FN 


where, TP is true positive, TN is true negative, FP is false positive and FN is false negative. 


Confusion Matrix 


600 
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Figure 5. Confusion matrix 


In addition, cross-validation had performed on the proposed dataset in order to examine the mode on 
the introduced dataset. The process was carried into three types 3, 5 and 10 folds, in all the experiments the 
results show that the difference between the validation and training data is between 1-3% only which 
indicates no overfitting or underfitting issues. 
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3. MATHEMATICAL FORMULA AND PROPOSED ALGORITHMS 
This section explains the mathematical equations and the proposed algorithms used in WVC based 
on the introduced dataset. The dataset is a collection of unstructured data which consist of video descriptions 
and user comments. The comments collection denoted by Car and Aai represents a collection of video 
descriptions that needs to be classified into one of the four classifications. Both Can and Aan are used to 
extract the feature representations. The proposed feature representation can be represented by the following 
terms. 
- Definition 1. There exists user comment (UC) and a collection of user comments (Can) where Call 
represent all features in C. Therefore, Cayx=UC!, UC’, UC?, ...... , UC" where n is the total, can be 
defined: 


JUCE Cau 


- Definition 2. There exists a video description (VDan) where Aan consists of one or several video. 
Therefore, VDan=VD!, VD?, VD?, ....., VD™, where m is the total video, can be defined: 


AVD € VDau 


- Definition 3. There also exists a set of extracted features (FCa) from (UC) in Call that can be 
represented as FCall=FC!, FC?, FC?......, FC", where FC is word/term and r is the number of features, can 
be defined: 


FCan={A4 FC € FCgu A FCau EC A C E Cau} 


- Definition 4. There also exists a set of extracted features (FAan) from (VD) in Aj that can be 
represented as FAai=FA!, FA’, FA?......,FA’, where FA is word/term and s is the number of features, can 
be defined: 


FAqu = {3 FA € FAgy| FAqn EVD A VD E VDay} 


- Definition 5. The maximum value of TF-IDF of extracted features (Foai_teinr) from collection of 
comments Can, denoted by MaxFC, as defined in (9). 


MaxFC = {V FC € FCgu|(FCan cmaxcrr—1r)) )t (9) 


- Definition 6. The maximum value TF-IDF of extracted features (FC) assigned to FAau, if it greater 
than TF-IDF (FA), denoted by FAan_tr-1pras defined in (10). 


FAqu rior = Max(TFIDF (FAgy) V MaxFC ) (10) 


- Definition 7. The average value TF-IDF of extracted features (FCan) assigned to FAan, if it greater than 
TF-IDF (FAan), denoted by FAall_TF-IDF, as defined in (11). 


FAaurp-ıpr = Average (TF — IDF (FAau) V MaxFC) (11) 
- Definition 8. Based on obtain the max and assigned to FAqy rrypr, as defined in (12). 


Matrix Representation = FC, U FA (12) 


lTFIDF allTFIDF 


Based on the aforementioned equations, there are two scenarios, the first scenario has applied the algorithm 1 
which is considered the average TF-IDF of the user comments to the extracted features (terms) of video 
description while the second scenario is algorithm 2 which consider the maximum TF-IDF of user comments 
to extract features (terms) of the video description. 


4. EXPERIMENTS, RESULTS AND DISCUSSION 
This section demonstrates the experiment results of the proposed models and algorithms using 
Classical MLCs and DL Models. The dataset description is included in the subsequent section. 
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4.1. Dataset 

In this section, the dataset used in the experiments is textual data collected from YouTube videos 
that contain metadata and user comments. It consists of four classes are health, education, politics and sport, 
the total number of comments is 8,046 after pre-processing steps. The maximum length of a comment is 1235 
in political class while the minimum length is one in all three classes. The detailed description of the dataset 
is shown in Table 1. 


Algorithm 1: Matrix representation for TF-IDF for user comments and average weighted feature of video 


description 

0 INPUT: 

1 User Comments (UC) 

2 Video Description (VD) 

3 Can denoted collection of comments 

4 Ax denoted collection of video description 
5 OUTPUT: Matrix Representation (TF-IDF (Can and Ajau)) 
6 BEGIN 

7 INT FCantr-mpr, FAan_tesr; 

8 CHAR Ca 7 Aan, uC, VD; 

9 While true Do 

10 Can=UC++4; 


11 Agi=VD++; 

12 End; 

13 While true Do 

14 FCan_rr-or=TF-IDF(Can); 

15 FAan rr-aor=TF-IDF(Aan); 

16 END; 

17 While true Do 

18 If Average (FCan_tr-ipF)> FAan_Tr-or Then 
19 FAan_rr-pr=Average (Can TF-DF); 

20 End IF; 

21 End; 

22 Matrix_ Representation=FCan_TF-IDF +FAa1_TF-IDF; 
23 End; 


Algorithm 2: Matrix representation for TF-IDF for user comments and maximum weighted feature of video 


description 

0 INPUT: 

1 User Comments (UC) 

2 Video Description (VD) 

3 Cy denoted collection of comments 

4 Aa denoted collection of video description 
5 OUTPUT: Matrix Representation (TF-IDF (Can and Aan)) 
6 BEGIN 

7 INT FCan rr-pr, FAan_tewr; 

8 CHAR Ca z Aan, uC, VD; 

9 While true Do 


10 Can=UC++4; 

11 Agi=VD++; 

12 End; 

13 While true Do 

14 FCan_rr-or=TF-IDF(Can); 

15 FAan rr-aor=TF-IDF(Aan); 

16 End; 

17 While true Do 

18 If Max(FCan_tr-mr)> FAan_rF-r Then 

19 FAan_rtr-ipr =Max(Can_TF-DF); 

20 End IF; 

21 End; 

22 Matrix_ Representation=FCan_TF-IDF +FAa1_TF-IDF 3 

23 END; 

Table 1. Dataset description 
Description/Item Class Val. Comments Number Percentage Max. Length Min. Length 
1 Education 1 2001 24.8% 99 1 
2 Health 2 2021 25.2% 970 2 
3 Politics 3 2017 25.1% 1235 1 
4 Sport 4 2007 24.9% 119 1 
Total 8046 100% 
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4.2. MLCs experiments 

In this section, the experiment based on classical machine learning classifiers was carried out using 
the most common classifiers; KNN, SVM, NB, LR, DT, SGD, and RF. This experiment was performed using 
Python 3.6 with the aforementioned pre-processing steps. The model performance with/out proposed 
algorithms were measured using the confusion matrix in precision, recall, f-score, and accuracy. 

There were four types of experiments carried out. These are as follows; experiment based on MCLs, 
experiment MCLs with applied algorithms 1 and 2 with 30 features, experiment MCLs with applied 
algorithms 1 and 2 with 40 features, and experiment based on MCLs with applied algorithms 1 and 2 with 50 
features. In the first experiment, the experiment based on MCLs was performed using N-grams in form of 
bigrams and trigrams without applied proposed algorithms on the proposed dataset. In addition, the user 
comments have been utilized only in this experiment. This is aimed to examine the model performance 
before using the proposed algorithms, the results are as shown in Table 2. Table 2 shows the comparative 
analysis of the results on model performance based on the first experiment between MCLs. The model 
accuracy using LR and SGD reached 87% and 88%, respectively with bigram. However, no improvement 
was recorded using trigram for all MLCs, practically, the experiments were repeated several times. 
Consequently, all the experiments were carried out only using the bigram. 


Table 2. Results of MLC without algorithm 1 and 2 


Bigram Trigram 
MLCs_ Class Precision Recall F-score Accuracy MLCs Class Precision Recall F-score Accuracy 
KNN 1 74% 61% 66% i 73% 63% 67% 
2 55% 87% 67% 2 55% 88% 68% 
3 47% T1% 56% 62% ENN 3 44% 85% 58% 63 % 
4 B% 49% 58% 4 80% 47% 59% 
SVM 85% 88% 86% 1 85% 87% 86% 87% 
2 90% 93% 91% 2 90% 93% 91% 
3 87% 90% 89% 87% SVM 3 87% 91% 89% 
4 86% 78% 82% 4 86% 78% 82% 
NB 1 75% 93% 83% 1 T3% 94% 82% 
2 87% 87% 87% 2 87% 87% 87% 
3 80% 89% 84% 83% NB 3 80% 89% 84% 83% 
4 90% 67% 77% 4 90% 67% 77% 
LR 1 87% 88% 87% 1 87% 88% 87% 
2 91% 92% 92% 2 91% 92% 92% 
3 88% 91% 89% 88% ES 3 88% 91% 90% 38% 
4 85% 81% 83% 4 86% 81% 83% 
DT 1 38% 97% 54% 1 38% 97% 54% 
2 54% 96% 70% 2 54% «96% 69% 
3 36% 94% 52% 26% DE 3 36% 94% 52% 56% 
4 99% 36% 53% 4 99% 36% 53% 
sGD 1 87% 87% 87% 1 88% 86% 87% 
2 91% 93% 92% 2 91% 93% 92% 
3 88% 89% 89% 87% SGD 3 88% 91% 89% aoe 
4 86% 82% 84% 4 85% 82% 84% 
RF 1 84% 87% 85% 1 85% 86% 85% 
2 89% 91% 90% 2 87% 91% 89% 
3 86% 88% 87% 86% RE 3 86% 88% 87% 86% 
4 85% 78% 81% 4 85% 79% 82% 


In the second experiment, 30 features were extracted from the video description and applied MCLs 
with applied algorithm 1 and algorithm 2. The outcome of this experiment has been compared with the 
results of the first experiment in order to measure the improvement of the model performance using the 
proposed algorithms. In this experiment, the algorithm 1 was applied that used the average of TF-IDF which 
outperformed model performance in term of accuracy in the first experiment as shown in Figure 6. 
Additionally, the results of this experiment show that algorithm 2 had been recorded a significant 
improvement compared to algorithm 1 shown in Figure 7. 

In the third experiment, the extracted features for MCLs with applied algorithms 1 and 2 were 
increased to 40. Therefore, the model performance recorded the highest accuracy compared to the 30 features 
shown in Figure 8. The highest accuracy recorded using RF reached 97, whereas using KNN recorded the 
worst. Thus, the model performance using algorithm 2 outperformed algorithm 1. 

The fourth experiment was to examine the performance of the 50 model-based features extracted 
from the video description or MCLs with applied algorithms (1) and (2). All the MCLs achieved the highest 
accuracy compared to all aforementioned experiments. The RF, NB, SGD reached 99% in terms of accuracy 
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as shown in Figure 9. Overall, based on the results of the fours experiment using the MCLs with applied the 
proposed algorithms, the highest accuracy has been attained using TF-IDF with algorithm 2 with 50 features. 
The experiments were repeated several times with more features however, the accuracy not improved. 
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Figure 6. The model performance between normal and algorithm 2 
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Figure 7. Accuracy of MCLs (max. and average TF-IDF of 30 features) 
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Figure 8. Accuracy of MCLs (max. and average TF-IDF of 40 features) 
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Figure 9. Accuracy of MCLs (Max. TF-IDF of 50 features) 
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4.3. Deep learning models 

This section explains the second type of experiment that is conducted using the deep neural network 
models based MPL on the proposed dataset. This is to measure the proposed model and algorithms using 
deep learning models and to examine the results and model performance compared with MCLs. Two model 
architectures have been applied are ANN and LSTM using the two proposed algorithms. 

In the ANN experiment, the model builds from 4 layers using Keras. The hyper parameter of the 
first input layer uses the input dimension of 1000 and an output of 128 neurons the activation function is 
“relu”. The second and the third layers are hidden. Their output shape consists of 64 neurons and 32 neurons 
with dropout (0.5), each using “relu” as an activation function. The output layer consists of four neurons 
using “softmax” as an activation function. The optimizer used is ‘Adam’ with a learning rate of 0.001. The 
loss function is ‘sparse_categorical_crossentropy’ and the accuracy is the training performance. A model 
training with 70% of the dataset that includes 20 epochs of 64 size batches is used in the training phase. Both 
proposed algorithms were applied, the validation and training accuracy loss has been decreased as shown in 
Figures 10 and 11 for algorithm 1 and Figures 12 and 13 for algorithm 2. The model has achieved high 
performance in the validation and training process in terms of accuracy. In the test phase that uses 30% of the 
dataset, the model performance has been achieved is approximately 93% and 99% in terms of testing 
accuracy using algorithm 1 and algorithm 2, respectively. Besides, the experiments with the same 
configurations were repeated with different learning rates as shown in Table 3. 
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Figure 10. Training and validation loss-algorithm 1 
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Figure 11. Training and validation accuracy-algorithm 1 
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Training and Validation loss (Adam=Learning Rate 0.001) 
Algorithm (2) 


Figure 12. Training and validation loss-algorithm 2 
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Figure 13. Training and validation accuracy-algorithm 2 
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Table 3. Experiment results of Adam optimizer with different learning rate 
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Learning Rate Algorithm Loss Accuracy 
Training Validation Testing 
0.01 1 0.0430 0.98 0.9172 0.9171 
2 0.0100 0.9972 0.9962 0.9962 
0.001 1 0.0351 0.9878 0.9307 0.9307 
2 0.0065 0.9981 0.9974 0.9973 
0.0001 1 0.4638 0.8474 0.8584 0.8584 
2 0.0812 0.9798 0.9962 0.9962 
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In LSTM experiment, the same dataset is used with a different model architecture. The LSTM 
model architecture consists of a stack of layers of three LSTM layers. In the first layer, the shape of the 
output includes 128 LSTM units with an input dimension of 5,392 user comments and 500 features and 50 
features of video description. The second layer contains the same hyperparameters and 64 LSTM units while 
the third 32 LSTM units. The out layer with four units and the activation function is ‘softmax’. The input 
shape of the batch size is 64 and the number of training iterations is 20 epochs. The optimizer used is ‘Adam’ 
with a learning rate of 0.001. The loss function is ‘sparse_categorical_crossentropy’ and the accuracy as the 
training performance. The validation and training accuracy loss has been decreased with stability of 
improvement with 20 epochs as shown in Figures 14 and 15 for algorithm 1 and Figure 16 and 17 for 


algorithm 2. 
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Training and Validation loss (Using LSTM) 
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Figure 14. Training and validation loss (algorithm 1) 
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Figure 15. Training and validation accuracy (algorithm 1) 
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Figure 16. Training and validation loss (algorithm 2) 
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Figure 17. Training and validation accuracy (algorithm 2) 


4.4. Results discussion 

The majority of the related studies focus on the WVC based on video and audio methodologies. The 
textual information plays a significant role in achieving a high accuracy rate in enhancing WVC. This is due 
to many people using social media as a platform to express their opinions by commenting on videos. In 
addition, the majority of the used dataset as benchmarks such as MCG-WEBV and YouTube-8m is based on 
the English language that contains an image, video, audio, and meta-data. This study focuses on enhancing 
the multi-class WVC based on combined textual information that are user comments and video description 
through proposing two algorithms that utilize the user comments and weight TF-IDF through average or 
maximum to be assigned to the extracted features from video description. In order to evaluate the model 
performance, we have conducted experiments based on machine learning and deep learning methods. The 
experiment results using proposed algorithm 1 and algorithm 2 outperform existing methods, this is found 
when the experiment results of the model performance in terms of accuracy with more closed classifications 
approaches are compared. The comparison conducted on the existing approaches are mainly based on textual 
information. Meanwhile, the results were compared with the approaches with literature. In [29], the accuracy 
reached 80% through a combination of three approaches that utilize video and audio. In [30], model 
performance reached 64% for the visual information extracted from frames using a VSM classifier. The work 
in [31] also highlighted the importance of using sentiment analysis in retrieving data, with the model 
achieving 75.43% accuracy. 

In our experiment, we use a multi-class WVC for four classes (M=4), namely, sports, economics, 
health and education and the introduced dataset based on Arabic language. For the machine learning 
classifier, the model achieves the highest accuracies of 97%, 93% and 96% when algorithm 2 was applied to 
30 features for the RF, NB and stochastic gradient descent (SGD) classifiers, respectively. As for algorithm 
1, the accuracies reach 98%, 95% and 97% given 40 features for the RF, NB and SGD classifiers. KNN 
achieves the worst accuracies of 79% and 83% when algorithm 1 was applied to 30 and 40 features. In the 
machine learning experiments, the RF, NB and SGD classifiers always outperform the other classifiers. On 
the other hand, in the deep learning experiments, ANN and LSTM are applied. Both models produce the 
highest accuracy of up to 99% given a few numbers of layers, neurons and iterations (epochs). The loss 
function also decreases with a few iterations, with the learning rate being 0.001, which is better than 0.01 and 
0.0001. Generally, this result reflects the importance of utilizing user comments and video descriptions as 
informative features for enhancing WVC. Specifically, the average or maximum TF-IDF weights of user 
comments to be assigned to video descriptions’ extracted features are calculated using the proposed 
algorithms. In addition, 30, 40 and 50 extracted features of video descriptions are used in this study, with the 
category comprising 50 features achieving the best performance. 


5. CONCLUSION 

This paper focuses on enhanced video categorization based on user comments and video 
descriptions. There are various methods of WVC. They are visual-based, Audio-based, and textual-based. 
Hybrid methods such as video-based and Audio-based are given more attention in scholarly articles. This 
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proposed model utilizes the user comments and YouTube video metadata, specifically on video description in 
enhancing the WVC. Four experiments are carried out using MLCs and DL models with the proposed 
datasets, and two algorithms. TF-IDF extracted from the video description are used in three categories 30, 40, 
and 50. The results of these experiments emphasize the usage of the hyper user comments and video 
descriptions outperform the Standard methods that focus purely on comments. The usage of the third 
category with the 50 extracted features recorded the highest model performance in terms of accuracy. 
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